Add TSan stress suite (PR B for TSan CI)#2403
Open
cuihtlauac wants to merge 4 commits intomirage:eiofrom
Open
Conversation
New .github/workflows/tsan.yml runs the test suite under ThreadSanitizer on a nightly cron (plus manual and opt-in `tsan` label on PRs). Runs the default suite alongside a new @tsan-stress alias, with halt_on_error=0 so a single run surfaces every race TSan observes. Reports are uploaded as a workflow artifact. The existing multicore and QCheck-STM tests become scalable via env vars: IRMIN_STM_ITER, IRMIN_STM_PACK_ITER, IRMIN_MULTICORE_DOMAINS, IRMIN_MULTICORE_ITER. Defaults match prior behaviour, so normal `dune runtest` is unchanged. The @tsan-stress alias (test/irmin-pack/test_tsan_stress/) ships as an empty dispatcher; per-hotspot scenarios (dict refill, irmin_mem cache, watch globals, fs pool, append_only_file buffer) land in a follow-up PR. This adds detection only; no src/ changes. Expect the first nightly run to surface several known races from mirage#2397 — that output is the baseline for follow-up fixes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dune @fmt / ocaml-ci's lint-fmt rejected two short match/if bindings that fit on a single line. No semantic change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pins reproduction recipe for TSan race-hunting to the tarides ocaml-devcontainer-tsan image (OCaml 5.4 with a pre-built +tsan switch). VS Code "Reopen in Container" or `devcontainer up` then `opam install --deps-only .` is enough to iterate on the @tsan-stress suite locally, instead of building a +tsan switch from scratch. The CI workflow (.github/workflows/tsan.yml) is authoritative and does not use this config — the devcontainer is dev convenience only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Five new scenarios under test/irmin-pack/test_tsan_stress/ target mutable state hotspots that the existing multicore + QCheck-STM tests do not exercise across domains: - stress_mem_cache: races the global Hashtbl.t cache captured by Irmin_mem.Read_only.v (irmin_mem.ml:44) and the KMap mutable in the Read_only instance. - stress_watch: races the listen_dir_hook ref in watch.ml:28-29. - stress_ao_buf: races the unguarded Buffer.t in the rw_perm of Append_only_file (append_only_file.ml). - stress_dict: races the two unguarded Hashtbl.t caches plus last_refill_offset in dict.ml. - stress_fs_pool: races the shared Eio_pool instances in irmin_fs_unix.ml (mkdir_pool, openfile_pool). mem and watch produce clean TSan data-race warnings pointing at the exact hotspot; ao, dict and fs cause memory corruption fast enough that TSan fires via SEGV before it can write a detailed report — the SEGV itself is the race signal. The @tsan-stress dune alias runs each scenario in its own process via `|| true` so that a race-induced crash in one does not suppress the others. The tsan.yml workflow counts both "WARNING: ThreadSanitizer" and "ERROR: ThreadSanitizer" as findings, and includes the @tsan-stress build-dir reports in the uploaded artifact. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fills in the
@tsan-stressalias shipped empty by #2402 (PR A). Five per-hotspot race-hunting scenarios land undertest/irmin-pack/test_tsan_stress/, each targeting mutable state that the existing multicore + QCheck-STM suites do not exercise across domains.Stacked on #2402. Target is
eio; please merge #2402 first. This branch contains #2402's commit plus two more.Scenarios and what they produce under TSan
stress_mem_cachesrc/irmin/mem/irmin_mem.ml:44,51WARNING: ThreadSanitizer) pointing atHashtbl.addin the global cachestress_watchsrc/irmin/watch.ml:33listen_dir_hook :=assignmentstress_ao_bufsrc/irmin-pack/io/append_only_file.ml(Buffer.tinrw_perm)stress_dictsrc/irmin-pack/io/dict.ml:25-31stress_fs_poolsrc/irmin-fs/unix/irmin_fs_unix.ml:75,87(Eio_pool)The
@tsan-stressalias runs each in its own process (main.exe <name> || truechained), so one scenario's crash does not suppress the others.Correspondence with #2397
After locally merging #2397 on top of
eio+ this PR, re-running the suite gave the following result (reported withIRMIN_TSAN_STRESS_ITERup to 500):stress_watchd050ed73e0 Replace global refs with Atomic.t in watch.mlstress_mem_cache289e6272ff Protect in-memory store cache with a mutex in irmin_memEio.Mutexaround the global cache Hashtbl eliminates races atirmin_mem.ml:51, but the per-instancemutable t : KMap.tfield is still unsynchronised. Warnings now appear atirmin_mem.ml:72(Read_only.find),:86(Append_only.add),:136–:142(Atomic_write.test_and_set). Downstream symptom: concurrentStore.set_exnstill producesIrmin.Tree.find_tree.findv: encountered dangling hashandset_exn: Failure after 13 attempts to retry.stress_ao_bufb690c5b2b1 Protect append-only file buffer with a mutexstress_dict788ad16a7b Protect dict hashtables with a mutexstress_ao_buf.stress_fs_poolec0c10e619protectsIrmin_fs.IO_membut notIrmin_fs_unix'smkdir_pool/openfile_poolirmin_fs_unix.Takeaway: #2397 is necessary but not sufficient. Two concrete gaps remain (
irmin_memKMap,irmin_fs_unixpools) plus two cases where the fix is applied but TSan output remains unclear. Tracking the remainder in a follow-up issue (#2404).Workflow counts SEGV-on-race as a finding
.github/workflows/tsan.ymlsummary step now tallies bothWARNING: ThreadSanitizerandERROR: ThreadSanitizeroccurrences. Reports from bothcwd/tsan-report.*and the dune build dir (_build/default/test/irmin-pack/test_tsan_stress/tsan-report.*) are uploaded to the workflow artifact.Development recipe
Reproducible locally via the
.devcontainer/devcontainer.jsonadded in the first commit:Or VS Code → "Reopen in Container" with the committed devcontainer config.
What this does not do
Context: #2401 Phase 6.
Test plan
tsanlabel to this PR → workflow runs → five scenarios each produce atsan-report.<pid>artifact.data-race warnings: N(≥2 —memandwatch) andsignal/SEGV-on-race errors: M(≥2 —ao,dict).tsan-reports-<run>artifact; confirmmemandwatchreports citeirmin_mem.ml:51andwatch.ml:33respectively in the stack trace.watchdisappears from the nightly report, and follow-up issue tracks the remaining gaps.🤖 Generated with Claude Code